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ABSTRACT 


As programming must be learned by doing, introductory 
programming course learners need to solve many problems, 
e.g., on systems such as ’Online Judges’. However, as such 
courses are often compulsory for non-Computer Science (non- 
CS) undergraduates, this may cause difficulties to learners 
that do not have the typical intrinsic motivation for pro- 
gramming as CS students do. In this sense, contextualised 
assignment lists, with programming problems related to the 
students’ major, could enhance engagement in the learning 
process. Thus, students would solve programming problems 
related to their academic context, improving their compre- 
hension of the applicability and importance of programming. 
Nonetheless, preparing these contextually personalised pro- 
gramming assignments for classes for different courses is re- 
ally laborious and would increase considerably the instruc- 
tors’/monitors’ workload. Thus, this work aims, for the first 
time, to the best of our knowledge, to automatically clas- 
sify the programming assignments in Online Judges based 
on students’ academic contexts by proposing a new context 
taxonomy, as well as a comprehensive pipeline evaluation 
methodology of cutting edge competitive Natural Language 
Processing (NLP). Our comprehensive methodology pipeline 
allows for comparing state of the art data augmentation, 
classifiers, beside NLP approaches. The context taxonomy 
created contains 23 subject matters related to the non-CS 
majors, representing thus a challenging multi-classification 
problem. We show how even on this problem, our compre- 
hensive pipeline evaluation methodology allows us to achieve 
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an accuracy of 95.2%, which makes it possible to automati- 
cally create contextually personalised program assignments 
for non-CS with a minimal error rate (4.8%). 


Keywords 
non-CS majors, NLP, contextually personalised assignment 
lists 


1. INTRODUCTION 


Introductory Programming (often known under the label of 
‘CS1’) classes are now-a-days often compulsory for under- 
graduate courses that do not have computing as their ma- 
jor {10, 15, 20, 23]. CS1 is delivered to students majoring 
in, e.g., mechanical engineering, economics, etc. - whom we 
collectively name here ‘non-CS students’. It is common in 
such cases to find students with difficulty in interpreting as- 
signment texts, due to the lack of affinity with the area of 
the problem [22]. As a result, many of these students may 
be discouraged by CS1, as they fail to see the purpose that 
programming can have in their professional lives [10, 17, 23]. 


Moreover, programming must be learned by doing and, hence, 
learners need to solve many problems [11, 17-19, 27]. In this 
sense, ‘Online Judge’ systems can influence positively the 
learning process of non-CS students [12, 18, 20, 25], as sys- 
tems which allow students to submit programming assign- 
ments and provide real-time automatic code correction. As 
Programming Online Judges (POJ) have large numbers of 
problems registered in their problem banks [25], in principle, 
there would be plenty of problems to select from, for both 
students as well as teachers, allowing for a mass personalisa- 
tion - where one teacher could cater in parallel for the needs 
of many students. Nonetheless, the problems available on 
these systems often are collected or scraped from various 
environments that do not provide labelling [27], and thus 
it is laborious to find appropriate problems for non-CS stu- 
dents. This is more so the case, as the number of program- 
ming exercises is constantly increasing [25,27]. Therefore, 
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the automatisation of the categorisation of problems based 
on subject matter is becoming vital, to support instructors 
who teach computer programming disciplines. To illustrate, 
undergraduate students of Economics would be more famil- 
iar with an if-then-else problem using terms such as “interest 
rates” or “importation of goods” instead of a problem on the 
“growth of cells”, which may be completely out of their com- 
fort zone. Thus, we raise the following research question: 


How can we extract the subject matter from programming 
problem statements, to automatically match programming 
assignment lists to non-CS courses? 


Our main contributions with this paper are thus: 


e Proposing a new, wholistic methodology pipeline for 
the POJ contextual labelling problem, allowing to com- 
pare a variety of cutting edge shallow and deep learn- 
ing models, to experiment with the most recent data 
augmentation techniques (with or without augmenta- 
tion), NLP (based on BERT, Word2Vec, Glove), clas- 
sifiers (based on BERT, Random Forest, SVM, XG- 
Boost, GaussianNB, GradientBoosting, ExtraTree, Se- 
quential DNN, CNN, RNN) and validation. 


e Extracting, for the first time, to the best of our knowl- 
edge, automatically and precisely, subject matters re- 
lated to non-CS courses; we do this by using cutting 
edge NLP techniques on the statements of assignments 
available in a home-made online judge CodeBench' 
used with fifteen non-CS major programmes. 


e Proposing a subject-based contextualisation taronomy 
to map subject matters to non-CS courses, where CS1 
is compulsory. 


e We thus are enabling the contextual personalisation of 
programming assignment lists for non-CS courses. 


2. RELATED WORK 


There are many studies tackling the challenge of teaching 
introductory programming to non-CS students, based on a 
variety of angles. To illustrate, [10] employed collaborative 
scenarios to enhance teaching and learning programming 
in non-CS courses, whist [23] used an approach involving 
games and media. [15, 24] show that English-like (natural 
language) syntax can help non-CS students overcome the 
difficulties in learning programming syntax. Furthermore, 
a recent study [21] explains that effective motivational edu- 
cational design can enhance introductory programming stu- 
dents and teacher engagement. Despite these works repre- 
senting a move towards improving non-CS students engage- 
ment, linking text collections to general or domain-specific 
knowledge is essential [1,5]. More specifically, [14] argue that 
students’ experiences of the learning context have important 
implications for teaching and learning. Nevertheless, none of 
these aforementioned studies take the context of the problem 
into account. Especially untouched is the issue of contextu- 
alisation of the problem statements, ensuring that problems 
introduce only the degree of difficulty required to progress 
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in the programming knowledge and not additional complex- 
ity from strange contexts for the current learner (such as a 
geology context for economy students, etc.). 


Online judges (POJ) are increasingly being used to support 
introductory programming (CS1) classes. Via such envi- 
ronments, teachers can provide problems to be solved and 
students can submit their code and receive immediate feed- 
back [9,18, 25]. One of the issues of these systems is that, 
in general, the problems available are not categorised based 
on subject matter, topics, context, major, etc. In this sense, 
there are two recent works [3,27] which tackle the problem 
of topic extraction from such problems. In these studies, 
topic extraction is used for grouping problems in terms of 
their related programming knowledge components, concepts 
or skills. For example, a problem that can be solved by us- 
ing graph algorithms, such as breadth-first search, flood-fill 
or topological sort, can be classified into the graph category. 
Notice however that the target audience of these studies are 
more experienced POJ users. Instead, here we are not in- 
terested in categorising problems based on advanced topics. 


In fact, we tackle, for the first time, to the best of our knowl- 
edge, the challenge of extracting the subject matter from 
programming problem statements available in POJ systems 
used in introductory programming, in order to improve the 
teaching and learning process of CS1 for non-CS courses, by 
matching problems to non-CS majors. 


3. EDUCATIONAL CONTEXT 


In this paper, we use as study base, as said, the CodeBench 
Online Judge environment, which is self-designed and im- 
plemented, as it allows us the freedom to add the changes 
inspired by our research results. Thus, we analyse here run- 
ning the Introductory Programming (CS1) course at the 
Federal University of the Amazonas, via this self-designed 
POJ, which is delivered to 15 non-CS undergraduate degrees 
across the university. These courses are divided into 5 ma- 
jor areas: Mathematics, Physics, Engineering, Statistic and 
Geology. Three of the degrees belong to Mathematics, 2 to 
Physics, 8 to Engineering, 1 to Statistics and 1 to Geology. 
Figure 1 illustrates this configuration. 


As Figure 2 illustrates, during the CS1 course, students in 
our environment typically solve 7 assignment lists with prob- 
lems of increasing difficulty, using the Python programming 
language. They are allowed to solve the problems with an 
unlimited number of submission attempts, as long as they 
meet the deadline for solving all problems on a given list. 
The exercise lists always precede an exam on the same pro- 
gramming topic, both carried out in the Online Judge. Each 
list has an average of 10 questions, and the tests have 2 ques- 
tions. We call a list together with its exam a ’session’, where 
each session addresses a specific programming topic. Alto- 
gether, the course thus is formed of 7 sessions, that is, 7 
programming topics are covered during CS1. Each session 
lasts on average 2 weeks. 


During the 7 sessions, students work on the following pro- 
gramming topics: Sequential, Composite conditional struc- 
tures, Chained conditional structures, Repeating structures 
by condition, Repeating structures by count, Vectors and 
Strings and Matrices. Before the 7 sessions, students have a 
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Figure 1: non-CS undergraduate courses at the Fed- 
eral University of the Amazonas 
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Figure 2: CS1 course configuration 


first week to get used to the Python programming language, 
where they learn about Variables and Single Operations. 


Whilst, in our online judge, problems are well structured 
based on these programming topics as above, they lack a 
clear division based on the contexts (here, related major ar- 
eas) in which the problems are to be delivered. Please also 
note that, although the sessions are ordered by their increas- 
ing difficulty, the topics they are addressing are somewhat 
unrelated. Moreover, this increase in difficulty is typical for 
any CS1 course, be it offline or online. 


Thus, our POJ is generic enough and is hence a good envi- 
ronment in which to research approaches to automatic clas- 
sification by contexts, based on the statements, to build 
context-based personalised assignment lists, towards ulti- 


mately enhancing the engagement of non-CS students in 
their learning process. 


4. DATA 


The database in our Online Judge system consists of 986 
programming problems in the CS1 discipline. As said, the 
statements in the database were initially not categorised by 
context; thus, we proceeded to create a labelled corpus, by 
manually classifying the contexts of each statement, to fur- 
ther use to carry out the experiments. 


As labels, we adopted in this research contexts extracted 
from Zanini and Raabe’s definitions [26], which show that 
the context of problems plays an important role for novice 
programming students. Their study manually analysed the 
contexts of 428 programming problems statements used in 
introductory programming (as in our case) offered to 51 un- 
dergraduate courses. As a result, they found 20 possible 
contexts for these problems, as follows: mathematical, com- 
mercial, person, school, human resources, research, bank- 
ing, physics, production, sport, computational, traffic, date 
and time, environment, tax, safety, consumption, popula- 
tion, others, and gamble. 


We thus started with their proposed labels to annotate our 
problems. However, there were some groups of statements 
that could not be mapped over the above contexts. More- 
over, the context “others” is too general and provides no real 
information. Given that, we removed the context “others” 
and propose here some additional contexts, as part of our 
contribution, in order to annotate our larger set of state- 
ments. As a result of the above process, we produced a 
total of 23 contexts, which we grouped together in a new 
CS1 Context Taxonomy, which is described in Table 1. This 
includes the following contexts, as contributions of our re- 
search: Games, Movies and Series, Chemistry and Geogra- 
phy. In addition, the table shows the number of statements 
for each context labelled and used in this research, the de- 
scription of the contexts as well as the undergraduate courses 
that may have a high connection with the context. 


It is worth noting that we performed a statistic test that 
measures inter-annotator agreement to validate if our anno- 
tation process was conducted properly. To do so, we used 
Cohen’s kappa (k) [4], which shows the level of agreement 
between two annotators on a classification task. As a result, 
we achieved a k = 0.961, which is considered almost perfect 
agreement [2]. 


5. METHODOLOGY 


Figure 3 illustrates the proposed evaluation methodology 
pipeline used in the experiments of our research. We cre- 
ate here a unique, comprehensive pipeline, studying various 
combinations of the most popular and successful bleeding 
edge state-of-the-art techniques for natural language pro- 
cessing (NLP). The following subsections explain each step 
of our methodology. 


5.1 Data augmentation 

The data augmentation stage consists of balancing the train- 
ing data by paraphrasing it, using the pre-trained model 
BERT [6]. Importantly for our task, this allows for contez- 
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Context Focus of the Statement non-CS N 
course 
Mathematical resolution of purely mathematical problems, without this be- Mathematics 261 
ing applied to another context and Engineer- 
ing 
Commercial handling of products, goods, such as buying and selling, cal- Economy 120 
culation of commission, provision of services 
Games game application, be it a virtual game or even a table game; Digital games 96 
for example, in the database there are games of naval battles, courses 
as well as video games 
School to solve a school problem, such as averaging, passing or fail- Pedagogy 79 
ing verification 
Traffic related to the driver, car, mileage, accidents All courses 43 
Sport some activity involved with sport, such as running, football, Physical edu- 42 
classification cation 
Physics resolution of purely physical problems, without this being Physics and 36 
applied to another context Engineering 
Banking related to bank transactions, investment, balance, with- Economy 35 
drawal, deposit, stock exchange 
Human Resources problem related to human resources, such as salary calcula- Sociology and 35 
tion, data related to employees, calculation of bonuses, re- Psychology 
cruitment and selection of employees 
Movies and TV _ problem situation in a film or TV shows. To illustrate, there All courses 30 
Shows are questions from the movie Harry Potter about potion cal- 
culation 
Population problems on population data, such as birth rate, mortality Statistic 25 
rate, population growth; referring to either human or animal 
population 
Chemistry purely chemical problems, without this being applied to an- Chemical en- 23 
other context gineering 
Person problems with elements directly related to a person, like All courses 22 
weight, height, sex 
Date and time calculation of date or time, calculation of day, verification All courses 2A. 
of month, conversion of hours, minutes and seconds, time 
interval 
Safety control access, password verification, data security, encryp- Software engi- 20 
tion, validation neering 
Research providing statistical data of opinion polls Statistic and 18 
Journalism 
Environment relating to environmental issues, such as pollution, temper- Environmental 18 
ature engineering 
Health related to issues of fighting diseases Medicine 17 
Consumption calculation of water, electricity or telephone-related con- Economy 16 
sumption 
Geography resolution of purely geographical problems, without this be- Geology 11 
ing applied to another context 
Production related to the production of products, the quantity produced, Production 7 
production value, origin of the products engineering 
Computational computational issues, such as conversion of binary, decimal, Computer en- 6 
hexadecimal numbers, ASCII table gineering 
Tax calculation of taxes, such as income tax Economy 5 


Table 1: Our proposed CS1 Context Taxonomy and Data Set description, with respective non-CS undergrad- 


uate course name and Number of items per Context, N 


tual paraphrasing. Figure 4 illustrates a paraphrasing pro- 
cess based on a fragment of a statement from the category 
“Computational”. 


Figure 4 shows a new generated sentence with clear seman- 
tics for a human reader. Still, generated text sometimes 


misses such a clear structure. Nevertheless, our goal here is 
not to generate new sentences which could be meaningful for 
learners. Instead, we aim at creating artificial statements, 
which are not to be presented to humans, but will be used 
to expand the minority classes, providing variations to the 
predictive models (see bias-variance trade-off [8]). In other 
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Figure 3: Proposed Automatic Contextualisation Research Methodology Pipeline 


Original text 


Write a program that prints the following 
message on your computer screen: Hello world 


Applying 
paraphrase 


New text 
Consider some code that prints the above 


message on your machine: Hello world 


Figure 4: Paraphrasing Example using BERT [6] 


words, despite this irregularity in the semantic sense of the 
statement, it is possible to perceive that the new instance 
generated belongs to the same class from which it was de- 
rived from and, therefore, it may represent a useful addition 
for the learning algorithm (which is later, as can be seen, 
confirmed by the results). 


Nonetheless, as can be seen, despite the potential of such 
contextual paraphrasing, the new statements repeat some 
words from the original and keep almost the same number 
of tokens, which is a limitation of this method. As such, 
to prevent overtraining on artificial data (instances created 
using contextual paraphrases), we have set a limit of, at 
most, quadrupling the base of minority classes. We estab- 
lished this limit after some empirical experiments. That is, 
a statement is allowed to generate at most 4 new samples in 
the training base, as long as the new number of statements is 
below the number of instances of the majority class. Hence, 
this process may not render a perfectly balanced training 
base. To illustrate, imagine that the majority class has 10 
questions on the training set, while the minority class has 


1 question; with this paraphrasing algorithm, it is possible 
to extend the minority class for up to 5 questions (4 new 
samples + original statement). 


In this work, experiments were carried out with and with- 
out paraphrasing, in order to analyse how the balancing by 
paraphrasing can influence the results. 


5.2 Pre-processing 

As we used reliable data (problems statements created di- 
rectly by instructors/monitors), there was no need in our 
data processing of performing orthographic corrections, ex- 
panding contractions and other common data-cleaning steps. 
However, all our problem statements were originally in the 
Portuguese language. As there are many tools available for 
processing text written in English, we opted to translate our 
statements first into English, by using the googleTrans® li- 
brary. Subsequently, we proceeded in applying our pipeline 
processing on the English text obtained, with and with- 
out the use of stop-words removal and lemmatisation, using 
spacy®. As a result, we observed empirically that these two 
techniques were useful for data filtering in our pre-processing 
step. Next, we show how we further prepare the text for the 
machine learning algorithms. 


5.3. Text Representation 

The machine learning algorithms take as input a sequence 
of text to learn the structure of text, just like a human does. 
However, we need to convert the data in numerical form. As 
such, we represent our text data as a sequence of numbers 
(see Keras Tokenizer function*). Moreover, the ML algo- 
rithm expects each training instance to have the same length 
(same number of tokens). Thus we padded with zeros at the 
end sequences that are shorter than the maximum length 


? pypi.org/project /googletrans/ 
spacy.io 
‘keras.io/preprocessing/text / 
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sequence. To do so, we applied the Keras padding module? 
over the sequences. 


In addition, two different state-of-the-art NLP techniques 
for vector representation of words are used for competing 
against each other: googleNews-Vectors (W2V)° and Glove 
[16] word embeddings. Moreover, for the BERT classifier, 
we used its own layer of word embeddings. Similarly, for the 
other deep learning models, we used the word embeddings 
layers as provided by the Keras library’. The purpose of 
this step is to compare the NLP techniques in terms of per- 
formance with our data set. Therefore, we created a process 
to obtain the best model for automatic categorisation of con- 
teats of programming questions for our educational context. 
The process allowed us thus to carry out experiments with 
advanced Deep Learning methods, and to compare not only 
those approaches with each other, but also with classical 
approaches, such as shallow learning models. 


5.4 Classifiers 


For deep learning models we used: a) Convolutional Neural 
Networks (CNN) which have a convolutional layer, followed 
by three dense layers; b) Recurrent Neural Networks (RNN), 
with a recurring layer using a Long Term and Short term 
memory (LSTM) followed by three dense layers; c) RNN and 
CNN (RNN+CNN) stacked with the same configurations as 
those of the items a and b; d) Sequential Neural Network 
(SNN) with two dense layers and e) BERT for classification 
(notice that we used BERT for two purposes: i) perform 
contextual paraphrasing; ii) multi-classification). 


As we are tackling a multi-classification problem, the final 
layer for each neural network was represented by a softmax 
layer [13]. For all deep learning models, the configurations 
used above represent the default recommended ones from 
the literature [13]. 


Additionally, we used the following classical, shallow clas- 
sifiers, with the word embeddings from googleNews- Vectors 
and Glove: Random Forest Classifier (RFC), Support Vec- 
tor Machine (SVM), Extremely Randomised Tree Classifier 
(ETC), Gaussian Naive Bayes (GNB), XGBoost (XGB) and 
Gradient Boosting Classifier (GBC). 


5.5 Validation 

To validate the models, we employed the stratified valida- 
tion with 10 folds. This method divides the base into k 
partitions, using k — 1 for training and 1 for testing. After 
that, the accuracy of the test partition is calculated. This 
process is repeated k times, until all partitions have been 
used as a test. Finally, the average of the accuracy obtained 
in the tests is computed. It is noteworthy that each fold was 
divided proportionally to the number of statements present 
in each class in the database [13]. We implemented it using 
the StratifiedK Fold from scikit-learn. Notice that we per- 
formed the data augmentation only on the training sets of 
each training fold. Thus, there were no paraphrased texts 
in the test sets. 


°keras.io/preprocessing /sequence/ 
®code.google.com/archive/p/word2vec/ 
"keras.io/ 


To evaluate our models, we used the F1-score, as this metric 
combines precision and recall in an harmonic mean. This is 
useful because it gives much more weight to low values than 
a regular mean, which treats all values equally. Moreover, 
we used the weighted F1-score, which takes into account the 
proportion of each class. 


6. RESULTS AND DISCUSSION 


We built a total of 34 predictive models. Figure 5 illus- 
trates all the results obtained by all models applied in this 
research. From this figure, we can notice that paraphrasing 
improved the (weighted) F'1-score in all models. To illus- 
trate this boosting, the model GLOVE + SVM achieved a 
F1-score of 86%, without paraphrasing. Whereas with the 
paraphrasing, the model achieved 94%, an increase of 8%. 
To validate that, we performed the McNemar’s hypothesis 
statistical test, which is recommended to compare machine 
learning models [7]. We compared the models with or with- 
out the contextual paraphrasing. As a result, we confirmed 
that the paraphrasing statistically boosts all models, even 
after Bonferroni correction (p — values < 0.05/2). Table 2 
shows the classification performance of the models in terms 
of macro and weighted precision, recall and F1-score. More- 
over, this table shows the accuracy of each model. 


From a visual inspection of Figure 5, we can argue that 
the best model found is the BERT classifier with use of the 
contextual paraphrasing (BERT + PAR), as the model has 
the highest median and a low standard deviation. Moreover, 
this model achieved the highest recall, F1-score and accuracy 
(Table 2). To validate that, we also performed McNemar’s 
test. As a result, we confirmed our previous deduction as 
BERT + PAR statistically outperforms all the other models, 
even after Bonferroni correction (p— values < 0.05/33). As 
such, in Figure 6, we show the performance of this model 
for each context, as a heat-map plot. The rows represent 
the actual values, while the columns depict the predicted 
contexts. 


Figure 6 illustrates that, in general, our best model is capa- 
ble of recognising problems from each context with a high 
recall. Indeed, there are predictions in some classes with- 
out miss-classification such as Computational, Sports, etc. 
However, we can see some cases where the model made mis- 
takes. For example, the model gets confused between the 
classes Production and Commercial. This may have hap- 
pened because some problem statements could have come 
from a production context, but with focus on sales, which 
would be further related to the Commercial context. More- 
over, there are some problems that are actually from the 
context Production, classified by our best model as Date and 
time. This was an unexpected result for us. After visual in- 
spection, we noticed that some of these problems linked the 
efficiency of a company to the time-scale (e.g., how long a 
process took determined its efficiency). This is a possible 
explanation for such confusions within our model. 


Coupled with that, according to Table 1, it is possible to 
notice that the class Computational has only a few state- 
ments. Despite this low number of problems in this context, 
our model is able to recognize this minority class with no 
errors (100% of precision and recall). Still, the class Taz 
presents the lowest number of problems in our database. 
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Figure 5: All results (F1-score) 


But even so, our model achieved a recall of 80% in this con- 
text. Consider that the instances missclassified from the Taz 
context were allocated to Commercial, which makes sense, 
as, in some cases, these two contexts are related. 


Although the model achieved a high recall (95%) in the 
context of Games, instances that the model was not able 
to recognise were spread through multiple contexts (Com- 
mercial, Date and time, Physics, Tax, and Mathematical). 
The 2% error between Games and Tax can be explained by 
statements of games that comprise tariffs, e.g., when buy- 
ing a certain product within the game. For example, there 
are statements in our data set that discuss buying products 
for a character, such as a battle suit. Further, the error 
of 1% with the class Commercial could be due to a rea- 
son similar to that of the class Taz. To illustrate, within a 
game, some statements comprise the purchase of products. 
Regarding the class Date and time, an explanation would 
be statements that address some mission that the character 
needs to accomplish in a specific time. Regarding the error 
in the classes Physics and Mathematical, it may be due to 
statements in games that contain speed calculation. 


Another important analysis to be done occurs in the class 
Research. The model achieved a recall of 94%, whereas 6% 
of errors occurred in the class Person. One possible reason is 
that surveys are conducted based on a group of people. Also, 
there are statements in our database that contain research 
carried out on some characteristics of people, such as age 
group, education, etc. 


Another interesting outcome relates to the following classes: 
Banking and Commercial. Note that both presented confu- 
sion errors between each other, that is, the class Commercial 
presented wrong predictions in the class Commercial and 
vice-versa. This is justified because both classes deal with 
statements that involve money. 
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Furthermore, a similar situation occurs for the classes Health 
and Population. Here, errors could be due to statements 
addressing, e.g., the growth of a virus or bacteria. Thus, 
results may highlight relations between these contexts. 


Another interesting analysis relates to the majority class of 
our data set, that is, the class Mathematical. Note that it 
was possible to obtain here a 99% recall. Even more impor- 
tantly, note that few classes have errors in this class, that is, 
although we are dealing with the majority class, our model 
can differentiate, with high precision, all classes, against this 
one. To illustrate, only the following classes had a confu- 
sion error with respect to this class: Games, Geography and 
Commercial. Regarding the error presented in the predic- 
tion of the class Games, it is an error that could be justified 
by questions that deal with any type of calculation, given 
that any form of calculation can be directly related to the 
mathematical context. For the Geography class, the error 
could be justified, as we have noticed the existence of state- 
ments that deal with map scale conversion. Regarding the 
class Commercial, the error could be justified by calculating 
the price of a certain product. 


Nevertheless, we had unexpected outcomes as well. For ex- 
ample, it was arguably to be expected that the Physics class 
presented errors in the Mathematical class, given that state- 
ments that address a physical contextualisation deal with 
mathematical calculations. However, this does not happen. 
Thus, our model clearly differentiates here between even 
small details present in the statement of each context. 


In other words, although there is an error in the classification 
of some instances in the classes, most of these errors can 
be easily justified. This may suggest that the statements 
worked on in this research have multi-contextualisation, that 
is, a statement can address more than one context. However, 
what happens in practice is that one context is predominant, 
and the prediction of our model reflects this. Still, it is 
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Table 2: Classification performance of the predictive models (Pr: precision; Re: recall; F1: Fl-score; Acc: 


accuracy). 

Model Pr(Macro) | Pr(weighted) | Re(Macro) | Re(weighted) | F1(Macro) | F1(weighted) Acc 

GLOVE+RFC 95% 88% 78% 87% 84% 87% 86.8% 
GLOVE+RFC + PAR 94% 92% 86% 92% 89% 91% 91.6% 
GLOVE+ETC 95% 90% 80% 88% 86% 88% 88.3% 
GLOVE+ETC+PAR 95% 93% 87% 92% 90% 92% 92.3% 
GLOVE+XGBC 91% 87% 74% 87% 79% 86% 86.5% 
GLOVE+XGBC+PAR 92% 91% 82% 90% 85% 90% 90.2% 
GLOVE+GNB 90% 86% 78% 85% 82% 85% 85.0% 
GLOVE+GNB + PAR 88% 87% 82% 86% 84% 86% 86.7% 
GLOVE+SVM 80% 87% 71% 87% 75% 86% 86.8% 
GLOVE+SVM+PAR 95% 94% 89% 94% 91% 94% 93.7% 
GLOVE+GBC 79% 83% 68% 82% 72% 81% 81.7% 
GLOVE+GBC+PAR 83% 87% 77% 87% 79% 86% 86.6% 
GLOVE+KC 91% 93% 89% 93% 90% 93% 92.8% 
GLOVE+KC+PAR 91% 93% 90% 93% 90% 93% 93.1% 
W2V+RFC 95% 88% 78% 86% 84% 86% 86.1% 
W2V+RFC+PAR 94% 93% 87% 93% 90% 93% 92.7% 
W2V+ETC 95% 88% 79% 87% 85% 87% 86.8% 
W2V+ETC+PAR 95% 93% 87% 92% 90% 92% 92.3% 
W2V+XGBC 92% 87% 76% 86% 82% 86% 86.4% 
W2V+XGBC+PAR 91% 91% 86% 91% 87% 91% 90.7% 
W2V+GNB 90% 87% 78% 86% 82% 86% 85.7% 
W2V+GNB+PAR 88% 87% 81% 86% 84% 86% 85.9% 
W2V+SVM 85% 90% 79% 90% 82% 90% 90.2% 
W2V+SVM+PAR 96% 95% 91% 94% 93% 94% 94.3% 
W2V+GBC 77% 82% 69% 81% 73% 81% 81.3% 
W2V+GBC+PAR 83% 88% 78% 88% 80% 88% 87.8% 
W2V+KC 91% 92% 89% 92% 90% 92% 92.4% 
W2V+KC+PAR 93% 94% 91% 94% 92% 94% 93.9% 
KT+CNN 94% 91% 84% 91% 88% 91% 90.8% 
KT+CNN+PAR 92% 93% 90% 93% 91% 93% 93.2% 
KT+(RNN+CNN) 86% 91% 86% 91% 85% 91% 90.8% 
KT+(RNN+CNN)+PAR 89% 91% 87% 91% 88% 91% 91.4% 
BT+BERT 93% 95% 91% 95% 92% 95% 94.7% 
BT+BERT+PAR 94% 95% 92% 95% 93% 95% 95.2% 


potentially useful to further analyse this problem as a multi- 
contextual prediction task. 


7. LIMITATIONS 


One of the major limitations of this paper is related to data 
set size. Although we have a significant number of prob- 
lems, in the case of some contexts there is a small number 
of instances, due to the quantity of classes in our multi- 
classification problem. To address this limitation, we used 
cutting-edge NLP techniques to produce new instances on 
the training set, using contextual paraphrases. 


Moreover, our original problem descriptions were in Por- 
tuguese and hence, when we translated them to English, 
this may have introduced some errors from our automatic 
data processing. However, this was counter-balanced by the 
availability of the most cutting-edge NLP processing tools 
for the various steps involved in our pipeline, which were not 
available for the Portuguese language. 


In addition, this research worked with introductory topics to 
computer programming. It is thus less clear if the method- 
ology applies to more advanced topics of programming. For 
example, database disciplines may need a different approach. 
However, the holistic pipeline we propose can guarantee that 
the right method can outperform the others, thus ensuring 
area appropriateness. 


Another limitation arises from undergraduate courses that 
do not have programming in their curriculum. Although it is 
clear that in this research several courses may use program- 
ming for some activities, not all of them have programming 
topics in the curriculum. To illustrate, although our data set 
presents health issues that can be applied to the medical or 
nursing courses, unfortunately these undergraduate courses 
do not have programming topics in their curriculum. This 
may however change in the future, with the rise of the ubiq- 
uitousness of computing, and thus this research may have 
wider relevance and impact than originally envisioned. 
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Figure 6: BERT with Paraphrasing 


8. CONCLUSION AND FUTURE WORKS 


According to the results obtained and illustrated in this re- 
search, we can conclude that paraphrasing of the minority 
classes boosts results, that is, it was able to make predictive 
models more accurate and with greater recognition capacity, 
regardless of which NLP was used, that is, Glove, Word2vec, 
BERT, etc. 


In addition, our work was able to achieve a performance with 
high precision and a high recognition rate for all 23 classes 
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proposed in this article. That is, our best model, which is 
based on the BERT technique with paraphrase-balancing, 
was able to achieve an accuracy of 95.2% with a minimal 
error rate, which is no more than 4.8%. 


With that, the first step to generate personalised problem 
lists, according to the context of the undergraduate course, 
was taken. We have additionally provided a new context 
taxonomy for problems, as well as a comprehensive evalua- 
tion pipeline methodology for context-based personalisation 
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of problem lists. 


As future work we intend to further evaluate the effect of 
the personalised programming problem assignments using 
our method to detect the subject matter. Thus, we can 
explore if the performance of the non-CS students will be 
affected when solving problems related to their courses. 


In addition, three new experiments can be performed to 
analyse the generalisation power of our method. The first is 
to repeat the procedure on other online judge problem col- 
lections, but still at an introductory programming discipline 
level. The purpose of this experiment is to verify how gener- 
alisable our approach is across educational settings different 
from ours. We believe, nevertheless, that choices such as the 
programming language used in teaching CS1 will not be a 
factor that will prevent similar outcomes. 


As a second experiment, we would repeat the procedure 
with more advanced programming topics, to analyse if the 
method can be applied to these more complex types of top- 
ics. For example, disciplines such as data structures may 
be a research target. Finally, we envision to adapt our 
pipeline to perform automatic classification of the program- 
ming problems in terms of the topics used in the CS1 courses 
(Sequential, Composite conditional structures, Chained con- 
ditional structures, Repeating structures by condition, Re- 
peating structures by counting, Vectors and Strings and Ma- 
trices). Such a pipeline would be useful for several applica- 
tions, such as for problem recommendation, automatic an- 
notation, amongst others. 


Concluding, we believe that the automatisation of the clas- 
sification of statements by contexts is extremely relevant for 
several reasons, among which we highlight: i) statements 
which students are already familiar with can help in the 
process of engagement and learning; ii) students will find it 
easier to understand the relevance of programming in their 
professional lives; iii) teachers can use this automatisation 
to generate personalised lists, which would facilitate their 
work, since it would be too much work to select these prob- 
lems manually, in addition to which it could lead to human 
error and iv) students could use this automatisation to se- 
lect problems to which they are used to, facilitating their 
process of learning a certain programming topic. 
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